{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Reference: https://jupyterbook.org/interactive/hiding.html\n",
    "# Use {hide, remove}-{input, output, cell} tags to hiding content\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "import ipywidgets as widgets\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "from IPython.display import display\n",
    "\n",
    "sns.set()\n",
    "sns.set_context('talk')\n",
    "np.set_printoptions(threshold=20, precision=2, suppress=True)\n",
    "pd.set_option('display.max_rows', 7)\n",
    "pd.set_option('display.max_columns', 8)\n",
    "pd.set_option('precision', 2)\n",
    "# This option stops scientific notation for pandas\n",
    "# pd.set_option('display.float_format', '{:.2f}'.format)\n",
    "\n",
    "def display_df(df, rows=pd.options.display.max_rows,\n",
    "               cols=pd.options.display.max_columns):\n",
    "    with pd.option_context('display.max_rows', rows,\n",
    "                           'display.max_columns', cols):\n",
    "        display(df)"
   ],
   "outputs": [],
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Exercises"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Write a SQL query on the `baby` relation, read the result into a `pandas`\n",
    "  dataframe, and create a plot of how popular your name was over time. If you\n",
    "  used that plot to make a guess at your age, what would you guess? Is that\n",
    "  close to your actual age? Think of a potential reason."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- When plotting male and female baby names over time, you might notice that\n",
    "  after 1950 there are generally more male babies. Is this trend reflected in\n",
    "  the U.S. census data? Go to the Census website\n",
    "  (https://data.census.gov/cedsci/) and check."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Let's say you want to find the most popular male and female baby name each\n",
    "  year. You might write this query:\n",
    "\n",
    "  ```sql\n",
    "  SELECT Year, Sex, MAX(Count), MAX(Name)\n",
    "  FROM baby\n",
    "  GROUP BY Year, Sex\n",
    "  ```\n",
    "\n",
    "  But this code doesn't produce the right result. Why?\n",
    "\n",
    "  Now, write code to produce the most popular male and female name each year,\n",
    "  along with its count. *Hint:* The answer is a bit simpler than you might\n",
    "  expect because of a special SQLite rule (search for \"Bare columns in an\n",
    "  aggregate queries\"). "
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Come up with a realistic data example where a data scientist would prefer an\n",
    "  inner join to a left join, and an example where a data scientist would\n",
    "  prefer a left join to an inner join."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In the section on Joins, the `nyt` table doesn't have any duplicate names.\n",
    "  But a name could feasibly belong to multiple categories---for instance,\n",
    "  `Elizabeth` is a name from the Bible and a name for royalty. Let's say we\n",
    "  have a relation called `multi_cat` that can list a name multiple times---once\n",
    "  for each category it belongs to:"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "source": [
    "query = ''' \n",
    "SELECT *\n",
    "FROM multi_cat\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nyt_name</th>\n",
       "      <th>category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>bible</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>royal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Arjun</td>\n",
       "      <td>hindu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Arjun</td>\n",
       "      <td>mythological</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    nyt_name      category\n",
       "0  Elizabeth         bible\n",
       "1  Elizabeth         royal\n",
       "2      Arjun         hindu\n",
       "3      Arjun  mythological"
      ]
     },
     "metadata": {},
     "execution_count": 70
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "What happens when we join `baby` with this table? In general, what happens when\n",
    "there are *multiple rows* that match in both left and right tables?"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In a *self-join*, we take a table and join it with itself. For example, the\n",
    "  `friends` relation contains pairs of people who are friends with each other."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "source": [
    "query = ''' \n",
    "SELECT *\n",
    "FROM friends\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>self</th>\n",
       "      <th>other</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jim</td>\n",
       "      <td>Scott</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Philip</td>\n",
       "      <td>Tricia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Philip</td>\n",
       "      <td>Ailie</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     self   other\n",
       "0     Jim   Scott\n",
       "1   Scott  Philip\n",
       "2  Philip  Tricia\n",
       "3  Philip   Ailie"
      ]
     },
     "metadata": {},
     "execution_count": 73
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Why might a data scientist find the following self-join useful?"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "source": [
    "query = ''' \n",
    "SELECT *\n",
    "FROM friends AS f1\n",
    "  INNER JOIN friends AS f2\n",
    "  ON f1.other = f2.self\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>self</th>\n",
       "      <th>other</th>\n",
       "      <th>self</th>\n",
       "      <th>other</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jim</td>\n",
       "      <td>Scott</td>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Ailie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Tricia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    self   other    self   other\n",
       "0    Jim   Scott   Scott  Philip\n",
       "1  Scott  Philip  Philip   Ailie\n",
       "2  Scott  Philip  Philip  Tricia"
      ]
     },
     "metadata": {},
     "execution_count": 75
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- The self-join query in the previous exercise uses the `AS` keyword to rename\n",
    "  each relation. Why do we need to rename the relations in that query? In\n",
    "  general, under what circumstances do we need to rename relations in the\n",
    "  `FROM` clause? "
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Have names become longer on average over time? Write a SQL query, read the\n",
    "  result into a `pandas` dataframe, then produce a plot to answer this\n",
    "  question. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- What does the following SQL query compute? What does the `stat` column\n",
    "  contain?"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "source": [
    "query = ''' \n",
    "WITH yearly_avgs AS (\n",
    "  SELECT Name, AVG(Count) AS avg\n",
    "  FROM baby\n",
    "  GROUP BY Name\n",
    "),\n",
    "sq_diffs AS (\n",
    "  SELECT b.Name, POWER(Count - avg, 2) AS sq_diff\n",
    "  FROM baby AS b JOIN yearly_avgs AS y\n",
    "    ON b.Name = y.Name\n",
    ")\n",
    "SELECT Name, POWER(AVG(sq_diff), 0.5) AS stat\n",
    "FROM sq_diffs\n",
    "GROUP BY Name\n",
    "ORDER BY stat DESC\n",
    "LIMIT 10\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>stat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Michael</td>\n",
       "      <td>28296.71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Robert</td>\n",
       "      <td>26442.69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>James</td>\n",
       "      <td>26434.03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Richard</td>\n",
       "      <td>15849.65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Patricia</td>\n",
       "      <td>13428.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Matthew</td>\n",
       "      <td>13165.15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Name      stat\n",
       "0    Michael  28296.71\n",
       "1     Robert  26442.69\n",
       "2      James  26434.03\n",
       "..       ...       ...\n",
       "7    Richard  15849.65\n",
       "8   Patricia  13428.25\n",
       "9    Matthew  13165.15\n",
       "\n",
       "[10 rows x 2 columns]"
      ]
     },
     "metadata": {},
     "execution_count": 95
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In this chapter we found that you could make reasonable guesses at a person's\n",
    "  age just by knowing their name. For instance, the name \"Luna\" has sharply\n",
    "  risen in popularity after 2000, so you could guess that a person named \"Luna\"\n",
    "  was born around after 2000. Can you make reasonable guesses at a person's age\n",
    "  just from the *first letter* of their name?\n",
    "  \n",
    "  Answer this question by writing a SQL query, reading the result into a\n",
    "  `pandas` dataframe, then producing plots to see whether this is possible.\n",
    "  Then, see which first letters provide the most information about a person's\n",
    "  age. "
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.4",
   "mimetype": "text/x-python",
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "file_extension": ".py"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}